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Abstract 

The geometric median covariation matrix is a robust multivariate indicator of dis¬ 
persion which can be extended without any difficulty to functional data. We define 
estimators, based on recursive algorithms, that can be simply updated at each new 
observation and are able to deal rapidly with large samples of high dimensional data 
without being obliged to store all the data in memory. Asymptotic convergence prop¬ 
erties of the recursive algorithms are studied under weak conditions. The computation 
of the principal components can also be performed online and this approach can be 
useful for online outlier detection. A simulation study clearly shows that this robust 
indicator is a competitive alternative to minimum covariance determinant when the 
dimension of the data is small and robust principal components analysis based on 
projection pursuit and spherical projections for high dimension data. An illustration 
on a large sample and high dimensional dataset consisting of individual TV audiences 
measured at a minute scale over a period of 24 hours confirms the interest of consider¬ 
ing the robust principal components analysis based on the median covariation matrix. 

All studied algorithms are available in the R package Gmedian on CRAN. 

Keywords. Averaging, Functional data, Geometric median, Online algorithms, Online 
principal components, Recursive robust estimation, Stochastic gradient, Weiszfeld’s algo¬ 
rithm. 


1 Introduction 


Principal Components Analysis is one of the most useful statistical tool to extract informa¬ 
tion by reducing the dimension when one has to analyze large samples of multivariate or 


functional data (see e.g. Jolliffe (2002) or Ramsay and Silverman (2005)). When both the 


dimension and the sample size are large, outlying observations may be difficult to detect 
automatically. Principal components, which are derived from the spectral analysis of the 
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We consider in this work another robust way of measuring association between vari¬ 
ables, that can be extended directly to functional data. It is based on the notion of median 
covariation matrix (MCM) which is defined as the minimizer of an expected loss criterion 
based on the Hilbert-Schmidt norm (see Kraus and Panaretos (2012) for a first definition in 


a more general M-estimation setting). It can be seen as a geometric median (see Kemper- 


man 


(1987) or Mottonen et al. (2010)) in the particular Hilbert spaces of square matrices 
(or operators for functional data) equipped with the Frobenius (or Hilbert-Schmidt) norm. 
The MCM is non negative and unique under weak conditions. As shown in Kraus and 


Panaretos (2012) it also has the same eigenspace as the usual covariance matrix when the 


distribution of the data is symmetric and the second order moment is finite. Being a spatial 
median in a particular Hilbert space of matrices, the MCM is also a robust indicator of 
central location, among the covariance matrices, which has a 50 % breakdown point (see 


Kemperman (1987) or Maronna et al. (2006)) as well as a bounded gross sensitivity error 


(see Cardot et al. (2013)). 

The aim of this work is twofold. It provides efficient recursive estimation algorithms 
of the MCM that are able to deal with large samples of high dimensional data. By this 
recursive property, these algorithms can naturally deal with data that are observed sequen¬ 
tially and provide a natural update of the estimators at each new observation. Another 
advantage compared to classical approaches is that such recursive algorithms will not re¬ 
quire to store all the data. Secondly, this work also aims at highlighting the interest of 
considering the median covariation matrix to perform principal components analysis of 
high dimensional contaminated data. 

Different algorithms can be considered to get effective estimators of the MCM. When 
the dimension of the data is not too high and the sample size is not too large, Weiszfeld’s 


algorithm (see Weiszfeld (1937) and Vardi and Zhang (2000)) can be directly used to 
estimate effectively both the geometric median and the median covariation matrix. When 
both the dimension and the sample size are large this static algorithm which requires 
to store all the data may be inappropriate and ineffective. We show how the algorithm 
developed by Cardot et al. (2013) for the geometric median in Hilbert spaces can be adapted 
to estimate recursively and simultaneously the median as well as the median covariation 
matrix. Then an averaging step (Polyak and Juditsky (1992)) of the two initial recursive 
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estimators of the median and the MCM permits to improve the accuracy of the initial 
stochastic gradient algorithms. A simple modification of the stochastic gradient algorithm 
is proposed in order to ensure that the median covariance estimator is non negative. We 
also explain how the eigenelements of the estimator of the MCM can be updated online 
without being obliged to perform a new spectral decomposition at each new observation. 

The paper is organized as follows. The median covariation matrix as well as the re¬ 
cursive estimators are defined in Section 2. In Section 3, almost sure and quadratic mean 
consistency results are given for variables taking values in general separable Hilbert spaces. 
The proofs, which are based on new induction steps compared to Cardot et al. (2013), allow 
to get better convergence rates in quadratic mean even if this new framework is much more 
complicated because two averaged non linear algorithms are running simultaneously. One 
can also note that the techniques generally employed to deal with two time scale Robbins 


Monro algorithms (see Mokkadem and Pelletierj (2006) for the multivariate case) require 
assumptions on the rest of the Taylor expansion and the finite dimension of the data that 
are too restrictive in our framework. In Section 4, a comparison with some classic robust 
PCA techniques is made on simulated data. The interest of considering the MCM is also 
highlighted on the analysis of individual TV audiences, a large sample of high dimensional 
data which, because of its dimension, can not be analyzed in a reasonable time with clas¬ 
sical robust PCA approaches. The main parts of the proofs are described in Section 5. 
Perspectives for future research are discussed in Section 6. Some technical parts of the 
proofs as well as a description of Weiszfeld’s algorithm in our context are gathered in an 
Appendix. 


2 Population point of view and recursive estimators 

Let H be a separable Hilbert space (for example H = or H = L 2 (/), for some closed 
interval I Cl). We denote by (.,.) its inner product and by ||-|| the associated norm. 

We consider a random variable X that takes values in H and define its center m G H 
as follows: 


m := argminE [\\X — u|| — ||A'| 

u£H 


(1) 


The solution m 6 H is often called the geometric median of X. It is uniquely defined 
under broad assumptions on the distribution of X (see Kemperman ( 1987| ) which can be 
expressed as follows. 

Assumption 1. There exist two linearly independent unit vectors (u\, 112 ) G H 2 , such that 

Var((u, X)) > 0, for u G {«i, ^2}- 


If the distribution of X — m is symmetric around zero and if X admits a first moment 
that is finite then the geometric median is equal to the expectation of X, m = E [V]. 
Note however that the general definition ([IJ does not require to assume that the first order 
moment of ||A|| is finite since |E [||A — u || — ||JT||] | < ||u||. 
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2.1 The (geometric) median covariation matrix (MCM) 

We now consider the special vector space, denoted by S(H), of dxd matrices when H = M rf , 
or for general separable Hilbert spaces H, the vector space of linear operators mapping 
H —>• H. Denoting by {ej,j € J} an orthonormal basis in H, the vector space S(H) 
equipped with the following inner product: 

(A,B) F = Y,(Ae j ,Be j ) ( 2 ) 

j&J 

is also a separable Hilbert space. In 5(K d ), we have equivalently 


(A,B) f = tr ( A t B ), 


(3) 


where A T is the transpose matrix of A. The induced norm is the well known Frobenius 
norm (also called Hilbert-Schmidt norm) and is denoted by ||.|| F . 

When X has finite second order moments, with expectation E [X] = p, the covariance 
matrix of X, E [(X — n)(X — p) T ] can be defined as the minimum argument, over all the 
elements belonging to S(H), of the functional 2 : <S{H) —> M, 


r) 


E 


Ux - p)(x - pf -r\ 


(x - p)(x - pf\\ F 


Note that in general Hilbert spaces with inner product (.,.), operator (A" — p){X — p) T 
should be understood as the operator u £ H t—» (u, X — p)(X — p). The MCM is obtained 
by removing the squares in previous function in order to get a more robust indicator of 
"covariation". For a 6 H, define G a : S(H) —> M by 


G a {V) := E [||(A - a)(X - a) T - H|| f - ||(A - a){X - a) T || F ] . (4) 


The median covariation matrix, denoted by T m , is defined as the minimizer of G m (V) over 
all elements V G S(H). The second term at the right-hand side of Q prevents from having 
to introduce hypotheses on the existence of the moments of X. Introducing the random 
variable Y := (X — m){X — m) T that takes values in S(H), the MCM is unique provided 
that the support of Y is not concentrated on a line and Assumption 1 can be rephrased as 
follows in S(H ), 

Assumption 2. There exist two linearly independent unit vectors (Vi, V 2 ) 6= S{H) 2 , such 
that 

Var«V, Y) F ) > 0, for V G {Hi, H 2 }. 

We can remark that Assumption [l] and Assumption [2] are strongly connected. Indeed, if 
Assumption[l]holds, then Var((it, X)) > 0 for u G {ui, u 2 }- Consider the rank one matrices 
Vi = u\u{ and H 2 = u 2 u^, we have (Vi, Y) F = (u\, X — m) 2 which has a strictly positive 
variance when the distribution of X has no atom. More generally Var((Vi,V)^) > 0 
unless there is a scalar a > 0 such that P [(ui, X — m) = a] = P [(ui, X — m) = —a] = \ 
(assuming also that P [X — m = 0] = 0). 
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Furthermore it can be deduced easily that the MCM, which is a geometric median in 
the particular Hilbert spaces of Hilbert-Schmidt operators, is a robust indicator with a 


50% breakdown point (see Kemperman (1987)) and a bounded sensitive gross error (see 


Cardot et al. (2013)). 


We also assume that 


Assumption 3. 


There is a constant C siLch that for all h £ H and all V G S(H ) 


(a) : E 

\\(X-h)(X-h) T -V\\ F r 

(6): E 

)\(X -h){X -h) T -vf/ 


< C. 

< C. 


This assumption implicitly forces the distribution of (X — h)(X — h) T to have no atoms. 


It is more "likely" to be satisfied when the dimension d of the data is large (see Chaudhuri 


(1992) and 

Cardot et al. 

o 

I— 1 

CO 

) for a discussion). Note that it could be weakened as in 

Cardot et al. 

(2013 

) by allowing points, necessarily different from the MCM T m , to have 


strictly positive masses. Considering the particular case V = 0, Assumption |3ja) implies 
that for all h € H, 


E 




<C, 


(5) 


and this is not restrictive when the dimension d of H is equal or larger than 3. 

Under Assumption^ a), the functional G'/, is twice Frechet differentiable, with gradient 

(X - h)(X - h) T - V 


XG h {V) = -E 

and Hessian operator, X\G(V) : S(H) 

i\n h )-v\\ F 


[\\(X-h)(X-h)T-V\\ F _ 

S(H), 

(Y(h) — V) ® F ( Y{h)-V) 


( 6 ) 


I 


S(H) 


\\ Y (h) - V\\p 


(7) 


where Y{h) = (X — h)(X — h) T . Is(H) is the identity operator on S(H) and A®p B{V) = 
(A, V) f B for any elements A, B and V belonging to S{H). 

Furthermore, T m is also defined as the unique zero of the non linear equation: 


VG m (r m ) — 0. 

Remarking that previous equality can be rewritten as follows, 

(A" — m)( X — m) T 


r — 

L m — 


i 


E 


■=rE 


[\\(X -m)(X -m) T -r m \\ F \ ’ 


( 8 ) 


(9) 


||(X-m)(X-mf r -r m || ir _ 
it is clear that T m is a bounded, symmetric and non negative operator in S(H). 


As stated in Proposition 2 of Kraus and Panaretos (2012), operator T m has an impor¬ 


tant stability property when the distribution of X is symmetric, with finite second moment, 


5 








































i.e E ||A || 2 < oo. Indeed, the covariance operator of A, £ = E [(A — m)(X — m) T ], 

which is well defined in this case, and T m share the same eigenvectors: if ej is an eigen¬ 
vector of E with corresponding eigenvalue A j, then r m ej = Xjej, for some non negative 
value Xj. This important result means that for Gaussian and more generally symmetric 
distribution (with finite second order moments), the covariance operator and the median 


covariation operator have the same eigenspaces. Note that it is also conjectured in Kraus 


and Panaretos (2012) that the order of the eigenfunctions is also the same. 


2.2 Efficient recursive algorithms 

We suppose now that we have i.i.d. copies Ai,..., X n ,... of random variables with the 
same law as A. 

For simplicity, we temporarily suppose that the median m of A is known. We consider 
a sequence of (learning) weights 7 n = Cry/n a , with c 7 > 0 and 1/2 < a < 1 and we define 
the recursive estimation procedure as follows 

(A n+ i - m)(A n+ i - m) T - W n 


Wn+l — I 'V n + 7 n 


W n +1 = w n - 


\\(x n+1 - m)( x n+1 - m) T - W n \\ F 

1 (W n - ffin+l) • 


( 10 ) 

( 11 ) 


n + 1 

This algorithm can be seen as a particular case of the averaged stochastic gradient al¬ 


gorithm studied in Cardot et al. (2013). Indeed, the first recursive algorithm (10) is a 


stochastic gradient algorithm, 

(X n+ i - m){X n+ i - m) T - W n 


E 


I T 

\ J n 


= XG m (W n 


. II (X n +i - m)(A n+ 1 - m) T - W n \\ F ' 

where T n = <t(A],..., X n ) is the <r-algebra generated by Ai,... ,X n whereas the final 
estimator W n is obtained by averaging the past values of the first algorithm. The averaging 


step (see Polyak and Juditsky (1992)), i.e. the computation of the arithmetical mean of the 
past values of a slowly convergent estimator (see Proposition |3.4| below), permits to obtain 
a new and efficient estimator converging at a parametric rate, with the same asymptotic 


variance as the empirical risk minimizer (see Theorem 3.1 below). 


In most of the cases the value of m is unknown so that it also required to estimate the 
median. To build an estimator of r m , it is possible to estimate simultaneously m and r m by 
considering two averaged stochastic gradient algorithms that are running simultaneously. 
For n > 1, 


m n+ 1 =m n + 7 ^ m) -j 


1 


777- n _|_i — Tfl, r 


X n -\-1 

-^n+1 ^n|| 

(m n - m n+ 1 ) 


E1+1 — ^ri T 7 n 


V n +1 — 


n + 1 

(X n+ 1 - m n ){X n+ x - m n ) T - V n 


n + 


11(^71+1 ^7l) (-^71+1 

1 T (Vn - V n+1 ) , 


( 12 ) 

(13) 

(14) 
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where the averaged recursive estimator m n+ i of the median m is controlled by a sequence 
of descent steps 7 ^ m \ The learning rates are generally chosen as follows, 7 = c m n ~ a , 
where the tuning constants satisfy c m E [2, 20] and 1/2 < a < 1. 

Note that by construction, even if V n is non negative, V n +1 may not be a non negative 
matrix when the learning steps do not satisfy 


In 


||(-^ra+l ^n)(-^n+1 TTln)'^' trills 


< 1. 


Projecting V n +i onto the closed convex cone of non negative operators would require to 
compute the eigenvalues of which is time consuming in high dimension even if V n +\ is 


a rank one perturbation to V n (see Cardot and Degras (2015)). We consider the following 


simple approximation to this projection which consists in replacing in (13) the descent step 
7 n by a thresholded one, 


T n,pos — min ('Tn, ||(-^-n+i Tfi n )( K X n ^ r \ 


(15) 


which ensures that 14.+1 remains non negative when V n is non negative. The use of these 


modihed steps and an initialization of the recursive algorithm (13) with a non negative 


matrix (for example Vq = 0) ensure that for all n > 1, V n and V n are non negative. 


2.3 Online estimation of the principal components 

It is also possible to approximate recursively the q eigenvectors (unique up to sign) of 
T m associated to the q largest eigenvalues without being obliged to perform a spectral 
decomposition of P n+ 1 at each new observation. Many recursive strategies can be employed 


(see Cardot and Degras (2015) for a review on various recursive estimation procedures of 
the eigenelements of a covariance matrix). Because of its simplicity and its accuracy, we 
consider the following one: 


Uj } n+1 — Uj,n T 


1 V 


n + 1 


u 


n+1" 


'J,n 


— U 


\u 


J,n 


‘j,n\ 


j = i,---, q 


(16) 


combined with an orthogonalization by deflation of rti, n +l, ■ • ■ ^g,n+i- This recursive al¬ 


gorithm is based on ideas developed by Weng et al. (2003) that are related to the power 


method for extracting eigenvectors. If we assume that the q first eigenvalues Ai > • • • > X q 
are distinct, the estimated eigenvectors u\ tn +i,.. .Uq !n +i, which are uniquely determined 
up to sign change, tend to AiUi,..., A q u q . 

Once the eigenvectors are computed, it is possible to compute the principal components 


as well as indices of outlyingness for each new observation (see Hubert et al. (2008) for a 
review of outliers detection with multivariate approaches). 


2.4 Practical issues, complexity and memory 


The recursive algorithms (13) and (14) require each 0(d 2 ) elementary operations at each 


update. With the additional online estimation given in (16) of the q eigenvectors associated 
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to the q largest eigenvalues, 0(qd 2 ) additional operations are required. The orthogonal- 
ization procedure only requires 0(q 2 d) elementary operations. 

Note that the use of classical Newton-Raphson algorithms for estimating the MCM (see 


Fritz et al. (2012)) can not be envisaged for high dimensional data since the computation 


or the approximation of the Hessian matrix would require 0(d 4 ) elementary operations. 
The well known and fast Weiszfeld’s algorithm requires 0(nd 2 ) elementary operations for 
each sample with size n. However, the estimation cannot be updated automatically if the 
data arrive sequentially. Another drawback compared to the recursive algorithms studied 
in this paper is that all the data must be stored in memory, which is of order 0(nd 2 ) 
elements whereas the recursive technique require an amount of memory of order 0{d 2 ). 

The performances of the recursive algorithms depend on the values of tuning parameters 
Gy, c m and a. The value of parameter a is often chosen to be a = 2/3 or a = 3/4. Previous 


empirical studies (see Cardot et al. (2013) and Cardot et al. (2010)) have shown that, thanks 
to the averaging step, estimator m n performs well and is not too sensitive to the choice of 
c m , provided that the value of c m is not too small. An intuitive explanation could be that 
here the recursive process is in some sense "self-normalized" since the deviations at each 


iteration in (10) have unit norm and finding some universal values for c m is possible. Usual 


values for c m and c 7 are in the interval [2,20]. When n is fixed, this averaged recursive 


algorithm is about 30 times faster than the Weiszfeld’s approach (see Cardot et al. (2013)). 


3 Asymptotic properties 

When m is known, W n can be seen as an averaged stochastic gradient estimator of the 
geometric median in a particular Hilbert space and the asymptotic weak convergence of 


such estimator has been studied in Cardot et al. (2013). They have shown that: 


Theorem 3.1. \Cardot et al. (2013), Theorem 3-4)- 
If assumptions l-3(a) hold, then as n tends to infinity, 

sfr (Wn - T m ) A/"(0, A) 

stands for convergence in distribution and A = (V^ n (T m )) 1 T (V^ l (T m )) 1 


where 

the limiting covariance operator, with T = E 


is 


(y(m)-r m )<g> y (y(m)-r m ) 

||y(m)-r m || F 2 


As explained in 


Cardot et al. 


(2013), the estimator W n is efficient in the sense that it 


has the same asymptotic distribution as the empirical risk minimizer related to G m (V ) (see 


for the derivation of its asymptotic normality in Mottonen et al. (2010) in the multivariate 


case and Chakraborty and Chaudhuri (2014) in a more general functional framework). 


Using the delta method for weak convergence in Hilbert spaces (see Dauxois et al. (1982) 


or 


Cupidon et al. (2007)), one can deduce, from Theorem 3.1 the asymptotic normality of 


the estimated eigenvectors of W n . It can also be proven (see Godichon-Baggioni (2016)) 


under Assumptions 1-3, that there is a positive constant I\ such that for all n > 1, 


E 


W n ~ T r 


K 

< —. 
n 
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Note finally that non asymptotic bounds for the deviation of W n around T m can be derived 


readily with the general results given in Cardot et al. (2016). 


The more realistic case in which m must also be estimated is more complicated because 
V n depends on rh n which is also estimated recursively with the same data. We first state 
the strong consistency of the estimators V n and V n . 


Theorem 3.2. If assumptions l-3(b) hold, we have 


lim \\V n - r m || F = 0 a.s. 

71^-00 

and 


lim \\V n - TJL = 0 a.s. 

n—>oo 11 11 r 

The obtention of the rate convergence of the averaged recursive algorithm relies on a 
fine control of the asymptotic behavior of the Robbins-Monro algorithms, as stated in the 
following proposition. 


Theorem 3.3. If assumptions l-3(h) hold, there is a positive constant C', and for all 
(3 G (a, 2a), there is a positive constant Cp such that for all n > 1, 


E 


E 


\V n -T 


m || f 


<4 

~ n° ’ 


\\v n +i -r, 


mWp 


<r 


The obtention of an upper bound for the rate of convergence at the order four of the 
Robbins-Monro algorithm is crucial in the proofs. Furthermore, the following proposition 
ensures that the exhibited rate in quadratic mean is the optimal one. 


Proposition 3.4. Under assumptions l-3(h), there is a positive constant c! such that for 
all n> 1, 


E 


14 - r 


m 


|| 2 
IIf 



Finally, the following theorem is the most important theoretical result of this work. It 
shows that, in spite of the fact that it only considers the observed data one by one, the 
averaged recursive estimation procedure gives an estimator which has a classical parametric 
y/n rate of convergence in the Hilbert-Schmidt norm. 


Theorem 3.5. Under Assumptions l-3(b), there is a positive constant K' such that for 
all n > 1, 


E 


V n ~T. 


■ I 2 

m\\ F 


< 


K' 


n 


rem 


Assuming the eigenvalues of T m are of multiplicity one, it can be deduced from Theo- 
3.5 and Lemma 4.3 in Bosq (2000|), the convergence in quadratic mean of the eigen¬ 


vectors of V n towards the corresponding (up to sign) eigenvector of T r 


9 





















4 An illustration on simulated and real data 


A small comparison with other classical robust PCA techniques is performed in this section 
considering data in relatively high dimension but samples with moderate sizes. This per¬ 
mits to compare our approach with classical robust PCA techniques, which are generally 
not designed to deal with large samples of high dimensional data. In our comparison, we 
have employed the following well known robust techniques: robust projection pursuit (see 
Croux and Ruiz-Gazen (2005) and Croux et al. (2007|)), minimum covariance determinant 


(MCD, see Rousseeuw and van Driessen (1999)) and spherical PCA (see Locantore et al. 
(1999|)). The computations were made in the R language (R Development Core Team 


(2010^), with the help of packages pcaPP and rrcov. For reproductible research, our 
codes for computing the MCM have been posted on CRAN in the Gmedian package. We 
will denote by MCM(R) the recursive estimator V n defined in (14) and MCM(R+) its non 
negative modification whose learning weights are defined in (15). 

If the size of the data n X d is not too large, an effective way for estimating T m is to 


employ Weiszfeld’s algorithm (see Weiszfeld (1937) and Vardi and Zhang (2000) as well 
the Supplementary file for a description of the algorithms in our particular situation). The 
estimate obtained thanks to Weiszfeld’s algorithm is denoted by MCM(W) in the following. 
Note that other optimization algorithms which may be preferred in small dimension (see 


Fritz et al. p012)) have not been considered here since they would require the computation 


of the Hessian matrix, whose size is d 4 , and this would lead to much slower algorithms. 
Note finally that all these alternative algorithms do not admit a natural updating scheme 
when the data arrive sequentially so that they should be completely ran again at each new 
observation. 


4.1 Simulation protocol 

Independent realizations of a random variable Y £ are drawn, where 

Y = (1 — 0(5))X + 0(5)e, (17) 

is a mixture of two distributions and X, O and e are independent random variables. The 
random vector X has a centered Gaussian distribution in M. d with covariance matrix [£pj = 
min (l,j)/d and can be thought as a discretized version of a Brownian sample path in 
[0,1]. The multivariate contamination comes from e, with different rates of contamination 
controlled by the Bernoulli variable 0(6), independent from X and e, with P(0(<5) = 1) = 5 
and P(0(<5) = 0) = 1 — 6. Three different scenarios (see Figure [I]) are considered for the 
distribution of e: 

• The elements of vector e are d independent realizations of a Student t distribution 
with one degree of freedom. This means that the first moment of Y is not defined 
when 6 > 0. 

• The elements of vector e are d independent realizations of a Student t distribution 
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Figure 1: A sample of n = 20 trajectories when d = 50 and 5 = 0.10 for the three different 
contamination scenarios: Student t with 1 degree of freedom. Student t with 2 degrees of 
freedom and reverse time Brownian motion (from left to right). 


with two degrees of freedom. This means that the second moment of Y is not defined 
when 5 > 0. 


• The vector e is distributed as a "reverse time" Brownian motion. It has a Gaussian 
centered distribution, with covariance matrix [E e ]^j = 2min(d — £,d — j)/d. The 
covariance matrix of Y is (1 — <5)E + SY e . 


For the averaged recursive algorithms, we have considered tuning coefficients c m = c 7 = 
2 and a speed rate of a = 3/4. Note that the values of these tuning parameters have not 
been particularly optimised. We have noted that the simulation results were very stable, 
and did not depend much on the value of c m and c 7 for c m , c 7 € [1, 20]. 

The estimation error of the eigenspaces associated to the largest eigenvalues is evaluated 
by considering the squared Frobenius norm between the associated orthogonal projectors. 
Denoting by P<j the orthogonal projector onto the space generated by the q eigenvectors of 
the covariance matrix £ associated to the q largest eigenvalues and by P y an estimation, 
we consider the following loss criterion, 


i?(Pq,Pq) = tr 



= 2 q — 2tr 


P/P 


<? 


(18) 
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Oracle MOD MCM(W) MCM(R+) MCM(R) SphPCA PP 


Figure 2: Estimation errors (at a logarithmic scale) over 500 Monte Carlo replications, 
for n = 200, d = 50 with no contamination (5 = 0). MCM(W) stands for the estimation 
performed by the Weiszfeld’s algorithm whereas MCM(R) denotes the averaged recursive 


approach and MCM(R+) its non negative modification (see equation 15). 


Note that we always have R( P g , P g ) < 2 q and R(P q , P g ) = 2 q means that the eigenspaces 
generated by the true and the estimated eigenvectors are orthogonal. 


4.2 Comparison with classical robust PCA techniques 

We first compare the performances of the two estimators of the MCM based on the 
Weiszfeld’s algorithm and the recursive algorithms (see @) with more classical robust 
PCA techniques. 

We generated samples of Y with size n = 500 and dimension d £ {50,200}, over 500 
replications. Different levels of contamination are considered : <5 £ {0,0.02,0.05,0.10,0.20}. 
For both dimensions d = 50 and d = 200, the first eigenvalue of the covariance matrix of 
X represents about 81 % of the total variance, and the second one about 9 %. 

The median errors of estimation of the eigenspace generated by the first two eigenvectors 
(q = 2), according to criterion (18), are given in Table[l]for non contaminated data (5 = 0). 
The distribution of the estimation error i?(P 9 ,Pq) is drawn for the different approaches 
in Figure [2] when the dimension is not large (d = 50). As expected, the "Oracle", which 
is the classical PCA in this situation, provides the best estimations of the eigenspaces. 
Then, the MCD and the median covariation matrix, estimated by the Weiszfeld algorithm 
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Oracle MOD MCM(W) MCM(R+) MCM(R) SphPCA PP PCA 


Figure 3: Estimation errors (at a logarithmic scale) over 500 Monte Carlo replications, 
for n = 200, d = 50 and a contamination by a t distribution with 2 degrees of freedom 
with 6 = 0.02. MCM(W) stands for the estimation performed by the Weiszfeld’s algo¬ 
rithm whereas MCM(R) denotes the averaged recursive approach and MCM(R+), its non 
negative modification with learning steps as in (15). 
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PCA MCD MCM(W) MCM(R+) MCM(R) 

d=50 0.0156 0.0199 0.0208 0.0211 0.0243 

d=200 0.0148 - 0.0200 0.0209 0.0246 


SphPCA PP 
0.0287 0.0955 

0.0275 0.0895 


Table 1: Median estimation errors, according to criterion R(P g ,Pg) with a dimension 
q = 2, for non contaminated samples of size n = 200, over 500 Monte Carlo experiments. 

or the modified MCM(R+) recursive estimator, behave well and similarly. Note that when 
the dimension gets larger, the MCD cannot be used anymore and the MCM is the more 
effective robust estimator of the eigenspaces. 

When the data are contaminated, the median errors of estimation of the eigenspace 


generated by the first two eigenvectors (q = 2), according to criterion (18), are given in 
Table [i] In Figure [3J the distribution of the estimation error i2(P g ,P g ) is drawn for the 
different approaches. 

We can make the following remarks. At first note that even when the level of con¬ 
tamination is small (2% and 5%), the performances of classical PCA are strongly affected 
by the presence of outlying values in such (large) dimensions. When d = 50, the MCD 
algorithm and the MCM estimation provide the best estimations of the original two dimen¬ 
sional eigenspace, whereas when d gets larger {d = n = 200), the MCD estimator can not 
be used anymore (by construction) and the MCM estimators, obtained with Weiszfeld’s 
and the non negative recursive algorithm, remain the most accurate. We can also remark 
that the recursive MCM algorithms, which are designed to deal with very large samples, 
performs well even for such moderate sample sizes (see also Figure [3]). The modification 


of the descent step suggested in (15), which corresponds to estimator MCM(R+), permits 
to improve the accuracy the initial MCM estimator, specially when the noise level is not 
small. The performances of the spherical PCA are slightly less accurate whereas the me¬ 
dian error of the robust PP is always the largest among the robust estimators. When, the 
contamination is highly structured temporally and the level of contamination is not small 
(contamination by a reverse time Brownian motion, with 6 = 0.20), the behavior of the 
MCM is different from the other robust estimators and, with our criterion, it can appear 
as less effective. However, one can think that we are in presence of two different popula¬ 
tions with completely different multivariate correlation structure and the MCD completely 
ignores that part of the data, which is not necessarily a better behavior. 

4.3 Online estimation of the principal components 

We now consider an experiment in high dimension, d = 1000, and evaluate the ability 


of the recursive algorithms defined in (16) to estimate recursively the eigenvectors of 


associated to the largest eigenvalues. Note that due to the high dimension of the data and 
limited computation time, we only make comparison of the recursive robust techniques with 
the classical PCA. For this we generate growing samples and compute, for each sample size 
the approximation error of the different (fast) strategies to the true eigenspace generated 


14 







t 1 df 

t 2 df 

inv. B. 

t 1 df 

t 2 df 

inv. B. 

<5 

Method 


d = 50 



d = 200 


2% 

PCA 

3.13 

1.04 

0.698 

3.95 

1.87 

0.731 


PP 

0.086 

0.097 

0.090 

0.085 

0.094 

0.084 


MCD 

0.022 

0.021 

0.021 

- 

- 

- 


Sph. PCA 

0.028 

0.029 

0.027 

0.027 

0.030 

0.028 


MCM (Weiszfeld) 

0.021 

0.021 

0.021 

0.021 

0.022 

0.022 


MCM (R+) 

0.022 

0.022 

0.024 

0.023 

0.023 

0.025 


MCM (R) 

0.026 

0.025 

0.027 

0.026 

0.027 

0.028 

5% 

PCA 

3.82 

1.91 

0.862 

3.96 

1.98 

0.910 


PP 

0.090 

0.103 

0.093 

0.089 

0.098 

0.087 


MCD 

0.022 

0.023 

0.021 

- 

- 

- 


Sph. PCA 

0.029 

0.031 

0.033 

0.029 

0.031 

0.034 


MCM (Weiszfeld) 

0.023 

0.023 

0.028 

0.022 

0.023 

0.030 


MCM (R+) 

0.025 

0.024 

0.035 

0.024 

0.024 

0.039 


MCM (R) 

0.029 

0.027 

0.037 

0.028 

0.028 

0.040 

10% 

PCA 

3.83 

1.96 

1.03 

3.96 

1.99 

1.10 


PP 

0.107 

0.108 

0.099 

0.088 

0.101 

0.097 


MCD 

0.023 

0.022 

0.023 

- 

- 

- 


Sph. PCA 

0.033 

0.033 

0.054 

0.031 

0.033 

0.057 


MCM (Weiszfeld) 

0.025 

0.026 

0.059 

0.023 

0.024 

0.056 


MCM (R+) 

0.030 

0.027 

0.089 

0.027 

0.027 

0.086 


MCM (R) 

0.035 

0.032 

0.088 

0.032 

0.031 

0.086 

20% 

PCA 

3.84 

2.02 

1.19 

3.96 

2.01 

1.25 


PP 

0.110 

0.135 

0.138 

0.091 

0.122 

0.137 


MCD 

0.025 

0.026 

0.026 

- 

- 

- 


Sph. PCA 

0.037 

0.038 

0.140 

0.034 

0.037 

0.150 


MCM (Weiszfeld) 

0.030 

0.030 

0.174 

0.026 

0.028 

0.181 


MCM (R+) 

0.044 

0.036 

0.255 

0.038 

0.032 

0.256 


MCM (R) 

0.050 

0.041 

0.251 

0.042 

0.037 

0.256 

Table 2: Median estimation errors, according to criterion R (P 

g,Pq) with a dimension 


q = 2, for datasets with a sample size n = 200, over 500 Monte Carlo experiments. 
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by the q eigenvectors associated to the q largest eigenvalues of Y m . 

We have drawn in Figure |4j the evolution of the mean (over 100 replications) ap¬ 
proximation error R(P q ,Pq), for a dimension q = 3, as a function of the sample size 
for samples contaminated by a 2 degrees of freedom Student t distribution with a rate 
6 = 0.1. An important fact is that the recursive algorithm which approximates recursively 
the eigenelenrents behaves very well and we can see nearly no difference between the spec¬ 
tral decomposition of V n (denoted by MCM in Figure Q and the estimates produced with 


the sequential algorithm (16) for sample sizes larger than a few hundreds. We can also 


note that the error made by the classical PCA is always very high and does not decrease 
with the sample size. 


4.4 Robust PCA of TV audience 

The last example is a high dimension and large sample case. Individual TV audiences are 
measured, by the French company Mediametrie, every minutes for a panel of n = 5422 


people over a period of 24 hours, d = 1440 (see Cardot et al. (2012) for a more detailed 


presentation of the data). With a classical PCA, the first eigenspace represents 24.4% of 
the total variability, whereas the second one reproduces 13.5% of the total variance, the 
third one 9.64% and the fourth one 6.79%. Thus, more than 54% of the variability of the 
data can be captured in a four dimensional space. Taking account of the large dimension 
of the data, these values indicate a high temporal correlation. 

Because of the large dimension of the data, the Weiszfeld’s algorithm as well as the 
other robust PCA techniques can not be used anymore in a reasonable time with a personal 
computer. The MCM has been computed thanks to the recursive algorithm given in ( |14| ) 
in approximately 3 minutes on a laptop in the R language (without any specific C routine). 

As seen in Figure [5j the first two eigenvectors obtained by a classical PCA and the 
robust PCA based on the MCM are rather different. This is confirmed by the relatively 
large distance between the two corresponding eigenspaces, R( P 2 pCA , CM ) = 0.56. The 
first robust eigenvector puts the stress on the time period comprised between 1000 minutes 
and 1200 minutes whereas the first non robust eigenvector focuses, with a smaller intensity, 
on a larger period of time comprised between 600 and 1200 minutes. The second robust 
eigenvector differentiates between people watching TV during the period between 890 and 
1050 minutes (negative value of the second principal component) and people watching TV 
between minutes 1090 and 1220 (positive value of the second principal component). Rather 
surprisingly, the third and fourth eigenvectors of the non robust and robust covariance 
matrices look quite similar (see Figure [6]). 


5 Proofs 


We give in this Section the proofs of Theorems 3.2, |3.3| and |3.5[ These proofs rely on 
several technical Lemmas whose proofs are given in the Supplementary file. 
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Figure 4: Estimation errors of the eigenspaces (criterion R(P q )) with d = 1000 and q = 
3 for classical PCA, the oracle PCA and the recursive MCM estimator with recursive 
estimation of the eigenelements (MCM-update) and with static estimation (based on the 
spectral decomposition of V n ) of the eigenelements (MCM). 
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Figure 5: TV audience data measured the 6th September 2010, at the minute scale. Com¬ 
parison of the principal components of the classical PCA (black) and robust PC A based on 
the Median Covariation Matrix (red). First eigenvectors on the left, second eigenvectors 
on the right. 
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Figure 6: TV audience data measured the 6th September 2010, at the minute scale. Com¬ 
parison of the principal components of the classical PC A (black) and robust PC A based 
on the MCM (red). Third eigenvectors on the left, fourth eigenvectors on the right. 
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5.1 Proof of Theorem 13.21 

Let us recall the Robbins-Monro algorithm, defined recursively by 


(-^n+l m n) (Xn+i m, n ) V n 


Vn+1 — In + 7 n 

(X n -\- 1 TTl'ii) (-^Gi+1 n^n) V n 
= Vn 7nPn+lj 


with U n+ i := _ ^n +1 m n )(x n+1 rn, n ) v n ' Since T n := a (Xi, X n ), we have E [Pn+il-^n] 

\\(X n+ 1 -m n )(X n+ 1 —m„) -Vn\\ F 

VGfn n (V n ). Thus ^ n +1 := Xm„G(V n ) — U n +i, (£ n ) is a sequence of martingale differences 
adapted to the filtration ( T n ). Indeed, E K+ilJ 7 ™] = V Grn n (V n ) — E \U n+ \\iF n ] = 0. The 
algorithm can be written as follows 


bn+l — V n 7nVG TTln (V)i) +7n£n+l- 


Moreover, it can be considered as a stochastic gradient algorithm because it can be de¬ 
composed as follows: 

V n+ 1 =V n -'Yn (VGW„04) - VGm„(T m )) + 'Jnt.n+l ~ InTn, (19) 

with r n := VGm„(r m ) — VG m (T m ). Finally, linearizing the gradient, 

Iri+l r m — ( Is(H ) r 'in'V m G (T m )) (K T m ) T 7n£n+l 77nGi Tn^nj (20) 
with 


4 := (V^ n G (T m ) - V^G (T m )) (K - T m ), 

Sn := V7Gm n (Vn) - VG^„ (T m ) - V^G (T m ) (V n - r m ) . 

The following lemma gives upper bounds of these remainder terms. Its proof is given 
in the Supplementary file. 

Lemma 5.1. Under assumptions l-3(b), we can bound the three remainder terms. First, 

( 21 ) 


m\\ . (22) 


V n — T m ||p . (23) 


11 $n 11 F 6G 11 V n — V m 11 p ■ 

In the same way, for all n> 1, 

IK|If < 4 (Vc + G^/||r m || F ) IK - 

Finally, for all n > 1, 

\\r' n \\ F < 12 ^Gy^jr^K + C 3/4 ^j ||m n - m|| || 
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We deduce from decomposition (33) that for all n > 1, 

H^n+l — Tm\\ 2 F = \\V n — r m ||> — 27 n (V n — r m , VGrn. n (Vn ) — VGm„(r m )) f 
+ 7n ||VG^ n (y n ) - VG mn (T m )\\ 2 F 

+ 7 n ll£n+l 1 | F + 2 t n (V n ~ ~ 7 n (VGm„ 04 ) — VGm n (r m )) , £n+l)_p 

+ 7n ll r n||^ ~ 27 n (r n , V n — r m ) F — 27^ (r n ,£ n _|_i — VGffi n (V n ) + VGm„(r m ))j? • 

Note that for all /i G iL and 17 7 S(H) we have ||VG^(V)|| F < 1. Furthermore, ||r n || f < 2 
and 11^n+i 11 f < 2. Using the fact that (£ n ) is a sequence of martingale differences adapted 
to the filtration (J r n ), 

® ||Vn+i — r m ||^ \j- n < \\v n — r m ||^, — 27 n (v n — r m , Vfn n G(v n ) — Vm„G(r m ))^ 

+ 28y 2 - 27 n (r n , V n -V m ) F . 

Let a n = n _/3 , with (3 E (1 — a, a), we have 


E 


||Vn+l ~~ r m ||^ < (1 + 7 nCkn) || Vn — V m ||^ — 27 n (U re — f m) Vf n n G (V n ) ~ 'S/ m n G (T m )) F 


(24) 


+ 28 7 ^ + — ||r n ||p . 
Oin 


Moreover, applying Lemma 5.1 and Theorem 5.1 in Godichon-Baggioni (2016), we get 
for all positive constant 5, 


\r n \\ 2 F = O ( || m n - m|| 2 ') = O 1 


,1+5' 


a.s. 


n 


Thus, since 2y n (V n — T m , Vm n G (V n ) - Vm„G (T m )) F > 0, the Robbins-Siegmund Theo¬ 
rem (see Duflo (1997) for instance) ensures that ||V^ — r m ||^ converges almost surely to a 


hnite random variable and 

^ ^ 7n (Vn — r m , Vm„G (14) — V m n G (T m )) p < +OO CL.S. 


n> 1 


Furthermore, by induction, inequality (24) becomes 

E 


||u n+ i - r m ||^ < n(l+7fc a fc) E llVi -r m ||p + 28 JJ (1 + 7fe«fc) Z)7fc 


*,fc=l 
' 00 


\k =1 


fc=l 


+ (H ( 1 + 7 “vS« e 


I 11 — 
rfe||_p 


Since (3 < a, applying Theorem 4.2 in Godichon-Baggioni (2016) and Lemma 6.1, there is 
a positive constant Co such that 


—e 

h ak 


1 M 2 

Ffcll _F 


Co ^ k a 1 13 < + 00 . 


fc=l 
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Thus, there is a positive constant M such that for all n > 1, E 


\v n - r 


771 ||_F 


< M. Since 


m n converges almost surely to m, one can conclude the proof of the almost sure consistency 
of V n with the same arguments as in the proof of Theorem 3.1 in Cardot et al. (2013) and 
the convexity properties given in the Section B of the supplementary file. 

Finally, the almost sure consistency of V n is obtained by a direct application of 


Topelitz’s lemma (see e.g. Lemma 2.2.13 in Duflo (1997)). 


5.2 Proof of Theorem 13.31 

The proof of Theorem |3.3| relies on properties of the p- th moments of V n for all p > 1 
given in the following three Lemmas. These properties enable us, with the application 
of Markov’s inequality, to control the probability of the deviations of the Robbins Monro 
algorithm from T m . 


Lemma 5.2. Under assumptions l-3(b), for all integer p, there is a positive constant M p 
such that for all n > 1, 


E 


Wn - Tr 


||2 p 

II F 


< M p . 


Lemma 5.3. Under assumptions l-3(b), there are positive constants C \, C [, C^, C 3 such 
that for all n > 1, 


E 

11 t/ _p 11 2 

|| v n m || 

< C±e ° inl a + ~ + C*3 SU P IE 

“g 

t— 1 

1 

£ 



^ E(n/2)-\-l<k<n—l 

L J 


where E(x) is the integer part of the real number x. 


Lemma 5.4. Under assumptions l-3(b), for all integer p' > 1, there are a rank n p i and 
positive constants C\y , C^y, c p i such that for all n > n p i, 


E 


\Vn+l -r. 


4 

m\\p 


_ 1 — a 

< ( 1 - c p / 7 n n <■' 


E 


\\Vn - r 


4 

m|| p 




n 


3a 


n 


2a 


I Vn-T r 


+ 


We can now prove Theorem 3.3 


Let us choose an integer p' such that p' > 3/2. Thus, 2 + a — 3^— > 3a, and 
applying Lemma 5.4 there are positive constants Cry,C^y, c p > and a rank n p > such that 
for all n > n p >, 


E 


|fn+l — r 


m\\p 


< 1 - C p '^ n n 


e \\v n -r m f F + % + %Eh|K-r m ||^ 

J n* a n za L 

(25) 

Let us now choose (3 E (a, 2a) and p' such that p' > 2 a-p - that 3 a — (3 > a + 

p i U Tl p i 


One can check that there is a rank n/, > n v > such that for all n > n',, 


(„+1 )“c ie - c ;”-“ + t+c 3 ^ \ n+ \ )f ,- a s 1, 
- . J 

1 Cp' 7 n n 


n + 1A >J , „3q Ci ,p' + 


n 


+ 2 


(n + l ) 3 "-/ 3 


< 1 . 


^3, 


3a—3 
n 
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With the help of a strong induction, we are going to prove the announced results, that 
is to say that there are positive constants Cy, Cp such that 2C p > > Cp > C p / > 1 and 
C p i > 2 ““ i ~ 1 C , 2 (with C *2 defined in Lemma 5.3), such that for all n > 1, 


E 

E 


w -r ii 2 

| v n L m Up 


\V —V I 

\ v n L ml 


Cp>_ 
n a 
Cl 
nP 


< 


< 


First, let us choose Cy and Cp such that 

max 

k<n', 


C p I > max (k a E ||14 
7 -^V L L 

p 


-r 


mWp 


}• 


Cp > max 

k<n', 
— P ' 


k p E 


v n ’ - r m 


Thus, for all k < n pl , 


E 

E 


\\v k -r m \\ 2 F 

\\v k -r m \\ 4 F 


< 


< 


Cy 

k a 

Ci 

kP 


We suppose from now that n > n' p , and that previous inequalities are verified for all 
k < n — 1. Applying Lemma 5.2 and by induction, 


E 


\\v n+ i - r m ||y < + ^ + °3 sup {e 


E((n-\-l)/2)-\-l<k<n 
< C 1 e~ c 'i nl “ + + C 3 sup 

n E((n+l)/2)+l<fc<n 


Cl 

kP 


<c ie - c i" , - + A + C32 />^. 


Cfi 


C / 

Since 2 C p ' > Cp > C p / > 1 and since C p / > 2 “ + 1 C 2 , factorizing by 


E 


II14+1 - r m |||,l < CyC' 1 e- c ^ 1 "“ + + C 3 2P 

J n a 


< 


CL 


(ra + l)“Cie- c i ni “ + 2‘ 


(n + l) c 


2Cp 

nP 

n 


Cp' Cz2P+ l Cp' 


n + lj 2(n + l) a (n + \)P~ a (n + l) c 


<— C i-C' 1 (n + l)“ e - c X-“ + I 1 


(n + l) c 


+ C 3 2 /3+1 - 


y 


2 (n + 1)“ ~ r (n + l)P~ a (n + l) a 

< (o, + iro,,^ + l + «»« 5 A J5 ) 


Cp/ 


(n + 1) ( 


By definition of n',, 


E 


IIK+i-r 


2 

m\\F 


< 


c 


: p' 


(n + 1 ) Q ‘ 

In the same way, applying Lemma |5.4| and by induction, 


(26) 


E 


||i 4 +i-r 


4 

m\\F 


< (1 - Cp'^ n n p' ) E 


it/ _ r ii 4 

\ v n L m\\F 


1 TO 7 Co 


n 3a n 2a 


||i4-r 
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m ||_p 


< 1 - Cp'^nTl p' ) -| + 




n 3 “ n 2 “ n“ 
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Since Cp>C p '> 1, factorizing by fcfp, 

< (l - c p /7„n p' ) ^ + (Ciy + C^y) 

( -^ = T\ (n + \\^Cg 

<(l-c r - 7 „n ' J ^ + - (n + i)3«-f. („+!)„ 


|bn+i r m || F 


n J 

~ P _|_ 23a ^l,p' + ^2,p' C /3 

n n 

< | (l-V7n«-^i ( — Y + 2^ C ^ +C ^ C f> 


By definition of n',, 


E 


\\v n+ i - r m ||p 


n 


< 


(n + l) 3 ""/ 3 / (n + l)/ 3 ' 


CL 


(n + l)/ 3 ’ 


(27) 


which concludes the induction and the proof. 

5.3 Proof of Theorem 13.51 


In order to prove Theorem 3.5 we first recall the following Lemma. 


Lemma 5.5 (Godichon-Baggioni (2016)). Let Yi, ...,Y n be random variables taking values 


in a normed vector space such that for all positive constant q and for all k > 1, E [||Y).|| 9 ] < 
oo. Then, for all real numbers oq,..., a n and for all integer p, we have 


E 




k= 1 


PI 


< X>|(E[||nf]) 

\fc=l 


(28) 


We can now prove Theorem 3.5 Let us rewrite decomposition (34) as follows 


V 2 m G (T m ) (K - T m ) = — - + £ n+1 -r n -r' n - 5 n , 


In 7 n 


(29) 


with T n := V n — T m . As in Pelletier (2000), we sum these equalities, apply Abel’s transform 
and divide by n to get 


v^g (r m ) (v n -r m ) = 


1 / T\ _ T n+ 1 
n \ 7i 7n+i 


E T * 

k=2 


1 1 

7 k 7fe-i 


J2 6k ~Yl rk ~H r ' k + Y. &+i 


We now bound the quadratic mean of each term at the rig) 

2 1 

= o (-). Applying Theorem 


First, we have -^E 

’ n z 


k=1 k=1 k= 1 /c=l 

it-h and side of previous equality. 

3.3 


n z 


T n +1 

2 " 

In 
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/ -2 


n 2 n _a 


= o | - 
n 


Moreover, since 1 — 7 fc 2i| £ 2ac 1 1 k a 3 , the application of Lemma 


5.5 


and Theorem 


3.3 
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since a < 1. In the same way, since ||<5 n || F < 6 C ||T n || F , applying Lemma 
rem 
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Since ||r^|| F < C 0 \\m n - m\\ \\V n - T m \\ 2 F with C 0 := 12 (C^/\\T m \\ F + C 3 / 4 ), Cauchy- 
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Schwarz’s inequality and Lemma |5.5| give 
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Applying Theorem 4.2 in Godichon-Baggioni (2016) and Theorem 3.3, 
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since /3 > 0. Finally, one can easily check that E ||K+i|If < 1, and since (£ n ) is a 
sequence of martingale differences adapted to the filtration (J- n ), 


4 e 

n z 


E&+i 


fc=i 


n“ 


n“ 


E E H^+illc + 2 E E E KCfc+l,^fc'+l)ir] 

\k= 1 /c=l fc'=fc+l / 

^ n nn 

E E ii^+iIIf + 2 E E e (^+i> e ^'+! 

v/c=l /c=l k'=k-\- 1 


-K' 


FJ 


n 

1 

< -. 
n 


-| n 

? E E [n&+i| 


fc=i 


Thus, there is a positive constant K such that for all n > 1, 


E 


|V^G(T m ) (v„ - r m ) I 


< 


K 


n 


Let A m i n be the smallest eigenvalue of V^G (r m ). We have, with Proposition B.l in the 
supplementary file, that A m ; n > 0 and the announced result is proven, 

K 


E 


V —F 

| v n L m 11 p 


< 


A min n ' 


6 Concluding remarks 

The simulation study and the illustration on real data indicate that performing robust 
principal components analysis via the median covariation matrix, which can bring new 
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information compared to classical PCA, is an interesting alternative to more classical robust 
principal components analysis techniques. The use of recursive algorithms permits to 
perform robust PCA on very large datasets, in which outlying observations may be hard 
to detect. Another interest of the use of such sequential algorithms is that estimation of the 
median covariation matrix as well as the principal components can be performed online 
with automatic update at each new observation and without being obliged to store all 
the data in memory. A simple modification of the averaged stochastic gradient algorithm 
is proposed that ensures non negativeness of the estimated covariation matrices. This 
modified algorithms has better performances on our simulated data. 

A deeper study of the asymptotic behaviour of the recursive algorithms would certainly 
deserve further investigations. Proving the asymptotic normality and obtaining the limiting 
variance of the sequence of estimators V n when m is unknown would be of great interest. 
This is a challenging issue that is beyond the scope of the paper and would require to 
study the joint weak convergence of the two simultaneous recursive averaged estimators of 
m and T m . 

The use of the MCM could be interesting to robustify the estimation in many different 
statistical models, particularly with functional data. For example, it could be employed 
as an alternative to robust functional projection pursuit in robust functional time series 
prediction or for robust estimation in functional linear regression, with the introduction of 
the median cross-covariation matrix. 

Acknowledgements. We thank the company Mediametrie for allowing us to illustrate 
our methodologies with their data. We also thank Dr. Peggy Cenac for a careful reading 
of the proofs. 

A Estimating the median covariation matrix with Weiszfeld’s 
algorithm 

Suppose we have a fixed size sample X\,... ,X n and we want to estimate the geometric 
median. 

The iterative Weiszfeld’s algorithm relies on the fact that the solution m* of the fol¬ 
lowing optimization problem 

n 

minV \\Xi — /i|| 

MS H ^ 

1=1 

satisfies, when m* 7 ^ Xi, for all i = 1 ,..., n 

n 

rn* n = Wi On) x i 
1=1 

where the weights Wi(x ) are defined by 



Weiszfeld’s algorithm is based on the following iterative scheme. Consider first a pilot 
estimator m of m. At step (e), a new approximation fhn + ^ to m is given by 


m. 


(e+l) - 


n 

E^H e) ) Xi - 


(30) 


2—1 


The iterative procedure is stopped when 
in advance. The final value of the algorit 


-(e+l) -(e) 

mi — m n 


< e, for some precision e known 
ini is denoted by fh n . 


The estimator of the MCM is computed similarly. Suppose C 6 ) has been calculated at 
step (e), then at step (e + 1), the new approximation T( e+1 ) to T m is defined by 


n 

ri e+1) = E w * ( ?(e) ) ^ ~ ^)CQ ~ ™*) T - 


(31) 


2—1 


The procedure is stopped when 


f (e+l) _ p(e) 


< e, for some precision e hxed in advance. 


Note that by construction, this algorithm leads to an estimated median covariation 
matrix that is always non negative. 


B Convexity results 

In this section, we first give and recall some convexity properties of functional Gh • The 
following one gives some information on the spectrum of the Hessian of G. 

Proposition B.l. Under assumptions 1-3(b), for all h £ H and V £ S(H), S(H) admits 
an orthonormal basis composed of eigenvectors o/V|G(V’). Let us denote by {A h,v,iU £ N} 
the set of eigenvalues of \/j i G(V). For all i £ N, 

0 < A h,v,i < C. 

Moreover, there is a positive constant c m such that for all i £ N, 

0 c m A c. 

Finally, by continuity, there are positive constants e, e' such that for all h £ B (m, e) and 
V £ B (T m , e'), and for all i £ N, 


T A/j y j T C. 


The proof is very similar to the one in Cardot et al. 

to 

o 

I — 1 

CO 

) and consequently it is not 

given here. Furthermore, as in Cardot et al. 

(2016 

), it ensures the local strong convexity 


as shown in the following corollary. 


Corollary B.2. Under assumptions l-3(b), for all positive constant A, there is a positive 
constant ca such that for all V £ B (T m , A) and h £ B (m, e), 

(VhG(V) - V h G(T m ), V - V m ) H >c A \\V- F m f F . 
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Finally, the following lemma gives an upper bound on the remainder term in the Taylor’s 
expansion of the gradient. 


Lemma B.3. Under assumptions 1-3(b), for all h E H and V E S(H), 

||vGti(f) - VG h (r m ) - (r m ) (v - r m )|| F < ec \\v - r m \\ 2 F . 


(32) 


Proof of Lemma |iO[ Let 5 v ,h ■= VG/ l (F) - VG h (T m ) - V 2 h G (r m ) (V - T m ), since 
VG h (V) - VG h (r m ) = fj V 2 h G (T m +1 (V - r m )) (V - r m ) dt, we have 

ri 


\\fiv,h ||p = 


v{g (r m + t(v- r m )) ((v - r m ) dt - v 2 h G (r m ) (v - r m ) 


< / ||v|g (r m + t(v- r m )) ((v - r m ) - v 2 h G (r m ) (v - r m )\\ dt. 

Jo 


As in the proof of Lemma 5.1 in Cardot et al. (2016), under assumptions l-3(b), one can 
check that for all h € H , and t E [0,1], 

pic (r m +1 (v - r m )) ((v - r m ) - v 2 h c (r m ) (v - r m )|| F <6 c\\v- r m \\ 2 F , 

which concludes the proof. □ 


C Decompositions of the Robbins-Monro algorithm and proof 
of Lemma 5.1 


Let us recall that the Robbins-Monro algorithm is defined recursively by 


14- 1-1 — 14 + 'In 


(Xn+1 m n) (Xn +1 Vi 


(Xn+1 TW'n) (A4+1 ) Vi 


— Vi r )nUn+li 


with Un +1 ■■= - ||A Xn+1 - n ,l { ^ n+1 — "r Jf n \\ ■ Let us remark that f n +i := V ? 

\\{X n+ 1 ~mn){X n+ i-m n ) Vn || p 


MVn) - 


U n + 1 , (£ n ) is a sequence of martingale differences adapted to the filtration {J- n ) and the 
algorithm can be written as follows 


14+1 = Vn - 7n (VG S „ (V n ) - (T m )) + 7n^ n +l ~ TnAi, (33) 

with r n := VGm n (T m ) — \/G m (T m ). Finally, let is consider the following linearization of 
the gradient, 

14+1 r m — (yJ-S(H) (I'm)) (14 r m ) T 'Ynfn+l In^n 7Vi4i 'In&ni (34) 

with 


r'n ■■= (v^g (r ro ) - v 2 m G (r m )) (14 - r m ), 

5n ■= ( 14 ) - vg„„ (r m ) - (r m ) (K - r m ). 
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Proof of Lemma 5.1. The bound of ||<5 n || is a corollary of Lemma B.3 
Bounding ||r n || 

Let us recall that for all h E H, Y(h ) := (X — h) (X — h) T . We now define for all 
h E H the random function : [0,1] —> S(H) defined for all t E [0,1] by 

Y(m + th ) - r m 


<Ph(t) ■■ = 


II Y(m + th) - T m || F ' 


Note that r„ = E 


Pm n —m( 0) (p rnn — m(l) 


X n 


. Thus, by dominated convergence, 


\\r n \\ F < sup E \\^ mn _ m (t)\ 


T 

J n 


iS [0,1] 

Moreover, one can check that for all h E H. 

h (X — m — th) T (A" — rn — th) h T 


•Phi*) = ~ 


|| Y(m + th) - T m || F \\Y(m + th) - T m || F 
+ (Y (m + th) -r m ,h (X -m- thf) Y ( m + th )- r ™ 

\ 1 j j ' F ||T(m + th) — T m || F 

+ (Y(m + th) - T m , (X — m — th) h T ) p ,, 1(m ^3 • 

||k (jn T th) T m \\ F 

We now bound each term on the right-hand side of previous equality. First, applying 
Cauchy-Schwarz’s inequality and using the fact that for all h, h! E H. ||h/i ,T || F = ||/i|| ||h/||. 


E 


h (.X — m — th) J 


||Y (m + th) - T m || F 


< \\h\\ E 


< ll/ill E 


|| AT — m — th\\ 

.11 Y(m + th) - T to || f 

\/\\Y(m + th)\\ F 


<11/111 E 


|| Y(m + th) -T m || F 

_ VW^mW F _ 

|| Y (m + th) - r m || F 


+ E 


\/ll Y{m + th) - T m || F 


Thus, since E 


\\Y{rn+th)-rm\\ F 


<C, 


E 


h (X — m — th) 1 


In the same way, 


E 


||Y (m + th) - F m || F 


11 (X — m — th) h T \\ p 
|| Y(m + th) - T m || F 


< 


(c^\\r m \\ F + Vc^ 


(35) 


< 


C\ \\T 


m\\F 


+ Vc\ 


(36) 
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Applying Cauchy-Schwarz’s inequality, 


E 


(y( m + th) — r m , h(X — m — th) T 


F 


|| Y(m + th) - Tmjjj. 

|| Y(m + th)~r m f F 


h(X — m — th) T 

__ F 

|| Y(m + th) - r m || F 


<IN|E 

<IN|E 


|| A — m — th || 

|| Y(m + th) - r m || F 

^\\Y(m + th)\\ F 
|| Y(m + th) -r m || F 


Thus, since E 


\\Y(m+th)-Tr, 


< (7, and since for all positive constants a, b , \Ja + b < yfa + y/b, 


II h\\E 


y/\\Y(m + th)\\ F 

< M 

i 

> 

1 _ 

+ E 

1 

1 

u 

1 

Cc" 

-fo 

+ 

_1 

II Y(m + th) - r m || F 

yJ\\Y(m + th) - r m || F 

) 


< m 



Finally, 


E 


(Y[m + th) — T m , h (X — m — th) T 


F 


|| Y(m + th) - r m || F 
|| Y(m + th) - r m || F 


E 


\(Y(m + th) 


r m , (X -m- th) h T ) F 


|| Y(m + th) -r m || F 

\\Y(m + th)-r m f F 



Applying inequalities (35) to (38) with h = m n — m , the announced result is proven, 


n\\F — 


< 4 y/C + CJ\\Y. 


m 11 F I 11 11 L n 


m r — m\\ 


Bounding ||r^|| 

For all h G H and V G <S(H), we define the random function <fhy '■ [0,1] —► S(H) 
such that for all t G [0,1], 


Vh,v{t) 


1 

|| Y{m + th) - r m || F 



(Y(m + th) — r m ) <8) F (Y(m + th) 
|| A (m + th) - r m || F 



(Y). 


Note that r' = E 


(Pmn—m,v„— r m (1) l Pm n — m,V„— r m (0) 


Tr 


By dominated convergence, 


| F < sup 
te [o,i] 


E 


| ^Prrin—m,V n — r m ify || 


T 

J n 


Moreover, as for the bound of ||r n ||, one can check, with an application of Cauchy-Schwarz’s 
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inequality, that for all h E H, V E S(H), and t E [0,1] 

|| Y(m + th ) - r m || F ||h T (X - m - th)\\ p 


Vh,v(t) < 6 
+ 6 


\\Y(m + th) -T m f F 
||y(m + th) — r m || F || h(X — m — th ) T | 


< 12 


\\Y{m + th)-T m f F 
||/j(X — m — th ) T || „ 

L, r f \w\\ F . 

||1 (m + th) - r m || F 


|| (Y(m + th) - T m ) ® F (Y(m + th) - T m ) (U)|| F 


Finally, 


E 


\\h(X — m — th) T \\ „ 

II v > 11 + ,| t/ . 

< F 

1 

7 

15“ 

-to 

1 

1 

15 “ 

1_ 

\W( , + u\ u ||2 II 1 1! F 

1 (m + th) - T m F 


1 

+ 

c-+- 

1 

•n 

1 _ 


<\\h\\ \\V\\ F E 

+ INIII^IfE 

< ( 


v/F 


mWp 


|| Y(m + th) - T m \\ F 

1 


_\\Y(m + th) - r m || F /2 

m UF + C 3/4 


F ■ 


(39) 


Then the announced result follows from an application of inequality (39) with h = m n — m 
and V = V n - T m , 


K\\ < 12 


||m n - m|| \\V n - r 


m\\]? • 


□ 


D Proofs of Lemma 5.2, 5.3 and 5.4 


Proof of Lemma 5.2. Using decomposition (33) 


1114+1 — r m || F — \\v n - r m || F — 2 'fri (14 — r m , xG mn (14) — VG mn (r m )) F 


+ 7nl|VG^ n (U n )-VG^ n (r m )||^ 

+ 7 n II4i+i|If + 27 n (14 ~ T m — 7 n (VGm„(14) ~ VGm n (r m )) , £n+l )p 

T 7n || r n || F — 2 r y n (r n , V n T m ) F — 2y n ( r n , 4i+i — V Gm n (Vn) T VGm n (f m ))p . 


Note that for all h E H and V E S(H) we have ||VG/ l (U)|| F < 1. Moreover, ||r n || F < 2 and 
||£ n +i|| F 4 2. Since for all h E H, Gh is a convex function, we get with Cauchy-Schwarz’s 
inequality, 


1114+1- r m || F a < || 14 - L m 11 F + 367,^ + 2^ n (^n+i, 14 — r m ) F 


< 2‘ n tn (T'nt V n r m ) F . 

(40) 


Let C' := 4 + G^/||r m || F j , let us recall that ||r n || F < C' \\m n — m||. We now prove 

by induction that for all integer p > 1, there is a positive constant M p such that for all 


n > 1, E 


114 — r 


1 2 p 

mWp 


< M p . 
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The case p = 1 has been studied in the proof of Theorem 3.2. Let p > 2 and suppose 
from now that for all k < p — 1, there is a positive constant such that for all n > 1, 


E 


IK-r 


1 2k 
m \\f 


< M k . 


w — r ii 2p 

| v n x ra||p 


Bounding E 

Let us apply inequality (40), for all p > 2 and use the fact that (£ n ) is a sequence of 


martingales differences adapted to the filtration 


E 

p 


[l|K+i 


-T 


12 P 

771 ||p 


< E 


K — r m || F + 36y 2 + 2y n 11 r n 11 ^ \\V n — T m 11 ^ 


k=2 


(^)lE ( 27 n (V n — T m ,£ n +i)(j|K — r m ||^ + 36y 2 + 2^ ||r n ||^ \\V n — r m || F ^ 


p—k 


(41) 


Let us denote by (*) the second term on the right-hand side of inequality (41). Applying 
Cauchy-Schwarz’s inequality and since H^n+illp < 2, 


(*) = E k) e ( Vn ~ r m,Cn+ i)) k (||K ~ r m ||^ + 36y 2 + 27 n ||r n || F \\V n - r m || F ) 

k=2 ' ' 

p / \ r 

< X] u) 22fc 7n'E IIK - r m \\ k F (llK - r m ||* + 36 7 2 + 2 ln ||r n || f ||K - r m || F ) P 


p—k 


k =2 


With the help of Lemma E.l 

p 

(*) < Y 2 2fc 3 p - fc " 1 7n E IIK - r m ||^" fc + Y 2 2fc 3 p - fc - 1 36 p - fc 7 2p " fc E ||K - T 


k 

m\lF 


k=2 
P 


k=2 


+ Y 2 p+fc 3 p_fc_1 7 ^E ||r n ||p fc ||K - r m ||5- 


k=2 


Applying Cauchy-Schwarz’s inequality, 


Y 2 2fc 3 p_fc_1 7 ^E 


k=2 


IK-r 


12 p—k 


n x ra||p 


Y 2 2fc 3 p_fc_1 7n IE 


k=2 

P 


IK-r. 


IP- 1 


n m\\F \\ y n ^ m Up 


K - r„ 


hp+i— fc 


< J^2 2fc 3 p “ fc “ 1 7n\/ IE 


fc =2 


|t/ p ||2(p 1) 

| 1 m||p 


|t/ p ||2(p+l—A;) 

| v n 1 m ||p 


E 


By induction, 

p 


Y 2 2k 3 p ~ k ~ 1 'y k K 


IK-r, 


i2p— 


n x m||p 


< 


E 2 2fc 3 p - fe - 1 7^y^x/i)V^ 


k=2 


k=2 


= 0( 7 2 ). 


(42) 
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In the same way, applying Cauchy-Schwarz’s inequality and by induction, 


j- 2 2fc 3 p-fc-i 36 p-fc 7 2p-fc E _ p m ||fcl = 2 2fc 3 p-fc-i 36 p-fc 7 2 P -fc E \\\v n - r m || F || 14 - r 


k -1 
m\\p 


k =2 


k =2 
P 


< ^ 2 2fc 3 p - fc - 1 36 p - fc 7 ^~ k 

k =2 

= O ( 7 *) , (43) 

since p > 2. Similarly, since ||r n ||^ < 2 and since p > 2, applying Cauchy-Schwarz’s 
inequality and by induction, 


E 2 p+fc 3 p - fc ” 1 7 PE [||r n ||?r fc UK - r m || 5 -] < J 2 2 2p 3 p_fc_1 7 p E [||K - r m || p ] 

k =2 

< ^2 2p 3 p - fc “ 1 7 P v / ^V^V7 


k=2 


k =2 

= O 62) 


(44) 


Finally, applying inequalities (42) to (44), there is a positive constant such that for all 
n > 1, 


E 


V / \ 

E k J ( 2 7n (K - r m ,^+i) F ) fc (||k - r m ||^ + 36 7 2 + 2 7n Urnll^ ||k - r m || F ) 


Lfc=2 


< 


' -,2 
n* 


(45) 

We now denote by (**) the first term at the right-hand side of inequality (41). With the 
help of Lemma E.l and applying Cauchy-Schwarz’s inequality, 

p 


(**)<E ||K-r m ||^ + E ( HE [(36 7 ^ + 2 ln (r n , V n - T m ) F ) k \\V n - T 


2p—2k 
m|| p 


< E 


k= 1 

P 


k= 1 


||K-r m ||^ +E(y 2fc ” lE (36 fe 7 f + 2 fe 7 nlkn||^||K-r m ||^) ||K-r m ||^ 


\2p— 2k 


Moreover, let 
p 


(***) :=E (r) 2 "' 1 ® (36 fc 7 f + 2 k 1 k n \\r n \\ k F \\V n -T m \\ k F ) ||K-r m |$ 


k =1 
P 




E l P \ c\k — 1 o a.k 2/ctt7 IIt/' v 1 II ^P—2/c 1 \ ' t P \ ^2k —11^. II ^ 11T/ - f II ^P—^ 
I > 1 2 36 7n E ||K-r m ||/ + zL 1 /,) 2 7n E Fnllp ||K - r m ||/ 


1 2p—2k 


k =1 

By induction, 

p 




4^ 1 r)2/c—1 fcjl 


fc=l 




fc=l 


E ( ^j 2 fc_1 36 fc 7 2fc E ||K - F 


\2p— 2k 


mWpp 


E l P \ c\k — lock-,2k i\/f 

( 7 J 2 36 7 n M p - k 


k= 1 




= O (7n) ■ 
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Moreover. 


E 

k =1 


P\o2k-l„,k 


7^ 


r || fc IIt/ _ r ||2p- fc - v ( P \2 2k - 1 ^ k ¥ Hr ll fc \\V — V ll 2? 
• n\\p II y n L m\\jr — / J I , I z in^ W' n\\ p \\ v n 1 m \ \ j? 


12 p—k 


k =2 


+ 2 p 7 „E ||r n || F ||K - r m ||/ 


2p—l 


Applying Cauchy-Schwarz’s inequality and by induction, since ||r n ||^ < 2 
v 

\k ii t/- -n ii 2p—k 


E 

k=2 


P \ q2/c— 1^,/c 


7*® 


| r ||* ||v _ r |Hf 
I' n ||p || v ri m||p 


P 

sE 

k=2 

P 

sE 

k=2 


p ) 2 3fc - 1 7% 


||Ty _r II ? 

|| ’'n 1 m 


2p—k 


P \ r\3k —1 /c 


In '\/Mp+ 1 —k \/Mp— i 


= O (7n) • 

Moreover, applying Theorem 4.2 in Godichon-Baggioni (j2016) and Holder’s inequality, 
since ||r n || F < C' \\m n — m\\, 


2pq n E 


m n — m 


\rn\\ F \\Vn-T m \\ 2 F 1 < 2C'p7nIE 
< 2C'p7n ( E 

J 

/f 2 *' / r 

2C +»77( E [ lie 


— P || 2 P — 1 

n x m ||p 


\m n — m 


|2p 


2p 


E 


p4-r m ||^ 


2p—1 
2p 


< 


_ p ||2p 
n x m 11 p 


2p —i 
2p 


Finally, 


2C ' p ^^f /2 ( E [ll^ - r ™llp]) < 2C 'p ^^2 max {i ,e [||T 4 - r m ||£]} 


2p— 1 7>" 2 P 

2r> . „ -L*-T) 

' n 

K, 
n 

Thus, there are positive constants Aq, A" such that 


< 


2 G P7n 1/2 (^E H'l'ra — r m ||p 


2 P 


(**) - ( X + A 0 n a+ 1/2 ) E JI Fn ~ + A l n a+l/2 


(46) 


Finally, thanks to inequalities (45) and (46), there are positive constants A ' {) , A\ such 


that 


E 


\\v n+1 - r 


2p 
m Up? 


< 


i + 4 


l 


n 


° n «+V 2 


E 


||K-r m ||^l +a;- 

J n 1 


1 


.0+1/2 


< 


1 


< 


n ( i+ ^o^+i/a 

°o , 

n [ i+A 'o^Tj 2 


k =1 


< M p , 


E 


E 


n n / \ 

ll^i -Tmllp] +5Z II f 1 + ^0^717^ J 


/c=l J=/c+l 
oo 


1 


P'i - r m ||p +n(i + ^ 
j=i 


Jja+l/2 / 1 £a+l/2 

1 \ “ 1 


E4 


ja+1/2 / Z_/ r.a+1/2 

7 7 fc=l 


which concludes the induction and the proof. 


□ 
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Proof of Lemma 5.3. Let us define the following linear operators: 


a n ^S(H) 7 nV m G(r m ), 
n n 

Pn := n** = n ( T S(H) - 7 fcV^G(r m )) , 

k= 1 /c=l 

A) := 


Using decomposition (34) and by induction, for all n > 1, 


Vn — fm — Pn-i (Vi — T m ) + /3 n _i M n — (3 n _iR n — /3 n _i— /3 n _iA n , (47) 


with 


n—1 


n—1 


-^n : — 'y ^ IkP^ £fc+lj 

fc=l 

n—1 

< : = ^7*/^% 


Rn ■ — /* ^ 7 

fc=l 

n—1 

^n := ^ ^ 7 kPk $k- 


k= 1 


fc=l 


We now study the asymptotic behavior of the linear operators P n and /3 n _i/L 1 . As in 


Cardot et al. (2013), one can check that there are positive constants cq, ci such that for all 


integers k,n > 1 with k < n — 1, 


ill < CnP A minEfc=l7n 

1-1 Wop -i L 0 e ! 


lAi-lVIlop ^ C l e 


Amin Y2j=k T? 


(48) 


where ||.|| op is the usual spectral norm for linear operators. We now bound the quadratic 


mean of each term in decomposition (47). 


Step 1: the quasi deterministic term / 3 n _i(Vi — r m ). 


Applying inequality (48), there is a positive constant Cq such that 


E 


■»-! (Vi-r m )||j, 


< WPn-lKpV 


\\Vi-r m \\ 2 F 


< c 0 e 2A minEfc=i7n]g ||Vi — r 

\\Vi-rji 2 F 


2 

m|| f 


< c 0 e~ c o nl “E 


(49) 


This term converges exponentially fast to 0. 

Step 2: the martingale term P n -iM n . 

Since (£ n ) is a sequence of martingale differences adapted to the filtration (J- n ), 

n—1 n—1 n—1 


E 


n—1 Afyj || p 


y 1 jj| Ai— lPf- 7fc^fc+l| 


k =1 
n—1 


4~ 2 y ^ y ' 7fe7fc'E [(/?n— lPk ^k+hPn— iPk' &'+l)p] 
k =1 k'=k +1 
n—1 n—1 

WPn-lP^lkfk+l^p + 2 ^ 7fc7fc' E [{Pn-lP^P.k+liPn-lPk^lf.k'+llFk'])! 

k= 1 k =1 k'=k-\- 1 

n—1 

^E [H^n-l/^feSfeCfc+lll 


k=l 
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Moreover, as in Cardot et al. (j2016), Lemma E.2 ensures that there is a positive constant 

(50) 


C[ such that for all n > 1, 


E 


'n— 1 M n 11 p 


<A 

n a 


Step 3: the first remainder term (3 n -iR n . 

Remarking that ||r ra || F < 4 (y/C + C'^/||r m || F ^ || m n — m\\, 





(n-l \ 2 

E 

WPn-lRnWp 

< E 

Zl7fe||/3n- 1 /3fe 1 || op |kfc|| F 




\fc=l / 


< 16 VC+ J\\T. 


m\\p 


E 


/ n— 1 




— m\ 


vfc= 1 


Applying Lemma 4.3 and Theorem 4.2 in Godichon-Baggioni (2016) 


E 


'n—lRnW p 


2 / ra —1 


< 16( VC + C^/||r m || F J ^7 l ||/3„-i4- 1 || op y'E 

< 16 ( v'c + chlirjj.) kAy^ 7*||A,-.+|| 7 




rrik — m\ 

v 

op £1/2 I 


Applying inequality (48), 


/n— 1 


E 


1 D p 
n—l-^n || p 


< 16 (VC + Cy/I\i ^2 7fc e 


E n 

j=k T? . 


^/c=l 


Ad / 2 


< 16 (\/C + C K\ J2 7fce" E ?= fc 7j 

Vfe=i 


A: 1 / 2 


Splitting the sum into two parts and applying Lemma E.2, we have 


E 


’ P II * 
n— l-^n || p 




<32( v / C + G v /||r m || F ) K x [ ]T 7fc e 


/ E(n/2) 


E n 

j=k T? . 


k =1 


fcV2 


+ 32 K/c + c\/||r. 


m \\p 


K\ ( 22 7fce E J= fc73 ' 

i /c=.E(n/2)H-l 


fcV2 


= 0(1 

. n 


Thus, there is a positive constant C 2 such that for all n > 1, 


E 


’ P II ^ 
n— 1 1 Va 11 p 


Co 

< -Z. 

n 


(51) 


Step 4: the second remainder term j3 n -iR! n . 

Let us recall that for all n > 1, ||r(J|^ < 12D \\rn n — m\\\\V n — V m \\ F with D : = 
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C^WA~F + <? 3/4 ' Thus, 





' (n-l \ 2 ‘ 
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|| Pn-lKWp 

< E 

E^IK-^IIJKIIf 




\fc=l / 


< 144D 2 E 


( n— 1 


E"^ ||/3n— 1 ||o P “ m ll ll^fc - r m || F 


\k =1 


Applying Lemma 4.3 in Godichon-Baggioni (2016), 


( n —1 


E 


\Pn-iR'n\\ F <144L» 2 ^ 7fc ||/3 n _ 1 4- 1 || op WE ||mfe — m|| 2 ||T4 — T 


vfc=l 


12 

m|| jr 


y _ r II 

v n L m || p 


Thanks to Lemma 5.2, there is a positive constant M 2 such that for all n > 1, E 
Thus, applying Cauchy-Schwarz’s inequality and Theorem 4.2 in Godichon-Baggioni (2016) 


(n— 1 


E 


| rrik — m\ 


\Pn-lK\\ F \ < 144 ^ E™ WPn-lPklop ( E 

Vfe=i 

< 144D' 2 yM2& uE 


E 


\\v k -r m \\ 4 F 


As in step 3, splitting the sum into two parts, one can check that there is a positive constant 
C'{ such that for all n > 1, 


E 


\Pn-lK 


"n 11 p 


< 


C‘ 


n 


(52) 


Step 5: the third remainder term: /3 n _ \A n 

Since ||<5 n || F < 6C ||V^ — T m ||^, applying Lemma 4.3 in Godichon-Baggioni (2016), 





’(n-l \ 2 ' 
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||/?n—1 || p 

< E 

( E^ 1 \Pn~lPk 1 op F ) 




\k= 1 ) 


< 36C 2 E 


( n—1 


E^lh-i 7 _1 IUllU-r 
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m\\p 


\k= 1 


/n—1 


<36C 2 ^ 7 fc||^-i/3 fc - 1 || op JE ||Vfc-r. 


\fc =1 


I 4 

m 11 p 


Thanks to Lemma 5.2, there is a positive constant M 2 such that for all n > 1, E 
Thus, splitting the sum into two parts and applying inequalities (48) and Lemma E.2 there 


11 / — r 11 

| v n L m || p 


< M 2 . 


< M 2 . 
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are positive constant Cq, C 2 such that for all n > 1 , 
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<72C 2 M 2 2 I ^ 7 k e~^= klj 
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1 1 L 
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m\\F 


}( x 

k=E(n/ 2)+l 
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m\\F 


} +o ( 


e -2c^n 1 -“ 


Thus, there is a positive constant Cq such that for all n > 1, 


E 


'n—1 An || p 


< C' 0 e 2c ° n a + C' 2 sup 

E(n/2)-\-l<k<n— 


|e [||Vfe — r m ||^] |. 


(53) 


Conclusion: 


Applying Lemma E.l and decomposition (47) 

, for all 

n > 1 , 



E 

lie — r 11 2 

|| v n L m\\F 

< 5E 

||/3 n _i(Vi -T m )||^ 

+ 5E 

1 1 Pn— 1 M n 11 p 

+ 5E 

11 fin— 1 Rn 11 f 



+ 5E 

)\Pn-lRn\\ 2 p\ +5E 

HAi— 1 An ||p • 




Applying inequalities (49) to (53), there are positive constants C\, C[, C' 2 , C 3 such that for 
all n > 1 , 


E 

1 

1 

71 

to 

< Cie"^ nl “ + ^ + C 3 sup E 

& 

1 

71 

_ 1 



n “ E(n/2)+l<k<n-l 

L J 


□ 


Proof of Lemma 5.f. Let us define W n := V n - T m - (VG m „(l4) - V(?m„(r m )) and 

use decomposition (33), 

ll^n+l rVnll p = || W n ||p + 7 n ||^n+l IIF 7n ll r n||,F + 27 n (£n+l) V n — r m ) F + 27 n (£ n +l, VCrn n {Vn))p 
— ^7 n { r niV7 Gm n (V n ) — V Gm„ (T m )) p — 27 n (r n , ~ r m ) ^ . 

Since |£ n+ i || F < 2, ||r n || F < 2 and the fact that for all h G H, V £ S{H), V/ l G(V) < 1, 
we get with an application of Cauchy-Schwarz’s inequality 

IIK+ 1 ~~ r m ||^ < ||W n || F + 27 ^, (£ n +i, V n — T m )p + 27 n ||r n ||^ IIK — r m || f + 2077 . 


Thus, since (£ n ) is a sequence of martingale differences adapted to the filtration (J~ n ), and 
since ||VL n ||^ < (l + C 2 c 2 ) \\V n — T m ||^ (this inequality follows from Proposition 
from the fact that for all h £ H, Gh is a convex application), 


B.l 


and 


E 


IIK+i - r 


4 

m|| f 


< E 


I W, 


n\\F 


+ 2y n E 


+ 40 (1 + C 2 c 2 ) 7 2 E 


+ 4 7 2 E 


ll r n ||F ll^n ||F ll^n L m ||p 

V n - T " 2 
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m \\f 

(Cn+i) V n — V m Yp + 4007(( + 40y^E 


r n 11 F II Vn V m 11 p 


+ 47 2 E \\r n \\ 2 F \\V n -Tm\\ 2 F 
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Since ||£ n +i|| F < 2 and < 2, applying Cauchy-Schwarz’s inequality, there are positive 

constants C [, C' 2 such that for all n > 1 , 


E 


\\v n+l -r m f F 


< E 


llf^nlli? + 27 nE ||r n ||^ ||W n ||^ \\v n 


C\ C'n ^ 


n 3a n 2a 


\\V n -T m \\ 2 F 

(54) 


We now bound the two first terms at the right-hand side of inequality (54) 

Step 1: bounding E ||W n ||^ . 

“ ' r ' - /o 


Since VGm„(K)-VG ffi „(r m ) = /(’ V^ n G (r m + t (V n - r m )) (14 - T m ) dt, applying Propo¬ 


sition B.l[ one can check that 


||w n || 2 = \\V n - T m \\ 2 F - 2 7n (14 - r TO , VG mn (V n ) - VG mn (T m )} H + 7 2 ||VG^(K) - VG mn (T m )\\ 2 F 
< (i + c 2 7 2 ) ||44 — r m ||jp — 2 7n (14 — r m ,VGW n (! 4 )- VGm n (r m ))^. 

Since for all h G H, Gh is a convex application, ||H4||^ < (l + c 2 4 2 ) || 14 — r m ||^. Let p' 
be a positive integer. We now introduce the sequence of events (4 n y) ?ig ^ defined for all 
n > 1 by 


A n pi : = jw G fl, || 14 (w) — I 4 J 4 < n~^~, and ||m. n (o;) — ?n|| < e| 


(55) 


with e defined in Proposition B.l For the sake of simplicity, we consider that e' defined in 


lo 

> II V n ~ r m ||^ 1 {||y n -r m || F <e'}- L 4.,p' 


Proposition B.l verifies e' < 1. Applying Proposition B.l let 

Bn : = (VGm„(14) — VGm n (Pm))14 — ^m) p ^-A n p i l{||y n _r m || F <e'} 

= L (44 _ r m ) , 14 — r m ) F dt 

f '\W p , (56) 

In the same way, since Gm n is convex, let 

B n '■ = (VGjn„ (14) — VG^ n (r m ), 14 ~ r m )p lA n y l|||y ri _p m || j _ i>e /| 

= f (^Tn„ (r'rn + ^ (14 — fm)) (14 “ I’m) ,14 “ f m ) l{||y n _r m || F >e'} ^ A np ,dt 


> 


s: 


Vn “I’m || p 


(^m„ (^m 4 ^ (^n T m )) (14 Tm) , 14 Fm) 1| y V n -V m \\ F >e'"\ ^ A n,p' ^ 


Applying Proposition B.l 
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II Vn— r m || f ± 11 2 

-C m || Vji 1 


F 1 {||b l -r m || F >e'} 1 ^n.p'^ 


> 


c c m 
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I v n L m 
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If 1 {lW-r m || F >e'} 1 4 1 ,p' 


> 
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-n p ||14 PmllF l{||y„_ rm || F> e'}l^„y 


(57) 
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/ _ 1 — Q= 

There is a rank n p , such that for all n > n ' p ,, we have p < ^c m . Thus, applying 

inequalities ([561) and (|57|), for all n > n',, 


C Cm 


nil F W P ' ~^r^ n P ' ) - r -H F 1 4y 

Thus, there are a positive constant cy and a rank n p > such that for all n > n p i, 

2 


E 


nil F lA n, P ' 


< 1 - 


€ Cm 


_ 1 — OL 

7 n n pl 1 E 


II ^n r m || F lA , 


< (1 — 2c p ij n n p' JE ||14 -T m || F 


(58) 


Now, we must get an upper bound for E || W n ||p 1 a c , ■ Since || W n |||i < (l + c 2 C 2 ) \\V n — T m ||^ 
and since there is a positive constant cq such that for all n > 1, 


ll^n — r m ||^ < ||ki — r m ||j, + 7fc < 


S con 


,1—a 


fc=l 


we have 


E 


nllir 


U* , <(i + c 2 c 2 ) 2 e \\ v n - r m \\ 4 F i A c 

n.v \ i / n.v 


< (1 + c 2 C 2 f 4 4 “ 4 T [A c n>p ,\ 

< (l + c 2 C 2 ) 2 CQn 4 ~ 4a (p [||m n - m\\ > e] + P \\V n - T m || F > n~^~ 


Applying Markov’s inequality, Theorem 4.2 in Godichon-Baggioni (J2016) and Lemma 5.2, 

„2^-y2\2 44—4 a 


E 


n\\p 1 j 4 c 
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c 2p" 
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< 
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(1 + c^C 2 ) Con 4_4Q_p + (1 + c 2 C 2 ) c 4 M ( 


2.-y 2\ 2 „4 Ji/T 4-4a-2 9 Sg' 


n 


Taking p" > 4 — a and q > p' 2 (i- a ) 
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n\\F 


1 A c , 


= O 


1 

3a 


n 


(59) 


Thus, applying inequalities (58) and (59), there are positive constants cy, C\ y and a rank 

C hp' 


n p / such that for all n > n p /, 
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l|W n || 


< 1 - 2c p f'y n n ~P~ E 


ll^n-r. 


4 

m ||_p 


+ 


rv 


3 a 


(60) 


Step 2: bounding 2y n E ||r n || F ||W„||^ \\V n - T m || F 
Since ||W„||^ < (l + c 2 C 2 ) \\V n -T m \\ 2 F , applying Lemma 

D n : = 2y n E ||7Vi||i? || W n \\p \\V n — T m ||p 

< 2 (l + c 2 CP) y n E \\r n \\p\\V n -r m \\ 3 p 

< — (l + c 2 C 2 ) 2 7nn^E \\r n \\ 2 p\\V n -V m \\ 2 p 


E.l 
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Since ||r n || F < ^\/C + C-y/||r m ||^ \\m n — m\\ F and applying Theorem 4.2 in 


Baggioni (2016), 


Godichon- 
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D n < -4- (l + c 2 C 2 ) 4 ( v/C + C^\\T m \\ F j ln n p' E 

/ ,_\ 4 


3^ 


— m 


+ <y 7n n p' E 
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I v n x m 
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_ 1 —a 

= <y 7n n p' E 
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(61) 


Step 3: Conclusion. 

Applying inequalities (54), (60) and (61), there are a rank n p > and positive constants 

C i C 2 ,p' 


Cpi, Cl,/, C 2 y, C- 3 ,p' such that for all n > n p /, 
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|tn+l T m || F 


< 1 - Cp'jnn p' E 
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■E 
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□ 


E Some technical inequalities 


First, the following lemma recalls some well-known inequalities. 

Lemma E.l. Let a,b,c be positive constants. Then, 

a 2 b 2 c 

atl -Tc + T’ 

c a 2 
a ^ — T —. 

~ 2 2 c 

Moreover, let k,p be positive integers and a \, ...,a p be positive constants. Then, 

k 


Y a > I ^p k l Y 


a „• 


.1=1 


j=i 


The following lemma gives the asymptotic behavior for some specific sequences of de¬ 
scent steps. 

Lemma E.2. Let a, (3 be non-negative constants such that 0 < a < 1, and (u n ), (v n ) be 
two sequences defined for all n > 1 by 

c u c v 

u n ■ = ~~Z j v n - = ~S i 

n a n p 

with c u ,c v > 0. Thus, there is a positive constant Co such that for all n> 1, 

E(n/ 2) 

Y e“ E? = feUj W = o(e- C0?ll ' a ), (62) 

k =1 

n 

Y e~^i= kUi u k v k = O (v n ), (63) 

k=E(n/ 2)+l 

where E(.) is the integer part function. 
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Proof of Lemma \E.2\ We first prove inequality (|62|). For all n > 1, 
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E(n/2) E(n/ 2) 
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> — n 1 
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Thus, 
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We now prove inequality ( |63[ ). With the help of an integral test for convergence, 
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Thus, 
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k=E(n/ 2)+l fc=.E(n/2)H-l 

With the help of an integral test for convergence, there is a rank n UfV (for sake of simplicity, 
we consider that n U)V = 1 ) such that for all n > n U}V , 

/»n+l 


Y e kl - a k~ a ~P < 

k=E(n/ 2)+l 


g * 1 a t~ a ~Pdt 


< 


>E(n/ 2)+l 

1 




1 — a 

= efn+lj'—fn+l)"^ + Q 


+ /3 


f 

JE\ 


^t-^dt 


l£(n/2)+l ' JE( n / 2 )+l 

/ /-n+1 \ 

I e* a t~ a ~ p dt , 

£’(n/2)+l / 


since a < 1. Thus, 
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As a conclusion, we have 
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Y e-Z?=k«i Uk v k = O (e-( n+1 ) 1_ “ +nl_a « n ) 

/c=.E(n/2)+l 

= o (v n ). 
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